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This  technical  report  1)  reviews  the  literature  on  bootstrapping  estimation  procedures  and  potential  applications  to  the 
selection  of  air  traffic  control  specialists  (ATCSs),  2)  describes  an  empirical  demonstration  of  procedures  for  estimating  the 
sample  size  required  to  demonstrate  criterion-related  validity  in  ATCS  selection,  and  3)  provides  summary  guidelines  and 
recommendations  for  estimating  sample  size  requirements  in  ATCS  selection  test  validation  using  bootstropping 
procedures  under  conditions  of  direct  and  indirect  range  restriction.  Bootstrapping  estimates  the  sampling  distribution  of  a 
statistic  by  iteratively  resampling  cases  from  a  set  of  observed  data.  Confidence  intervals  are  constructed  for  the  statistic, 
providing  an  empirical  basis  for  inferential  statements  about  the  likley  magnitude  of  the  statistic.  Correlations  between 
scores  on  the  written  ATCS  aptitude  test  battery  and  subsequent  performance  in  initial  qualification  training  for  a  large 
sample  of  10,869  controllers  hired  between  1986  and  1992  were  bootstrapped  in  an  empirical  demonstration  of  the 
methodology.  Finally,  a  three-step  sequence  of  procedures  is  described  for  use  in  future  bootstrap  estimates  of  confidence 
intervals.  Recommendations  for  sample  size  requirements  in  future  ATC  criterion  validity  studies  include: 

1.  Results  suggest  samples  of  at  least  N=  175  to  ensure  the  90%  confidence  interval  for  does  not  contain  0. 

2.  Assumptions  of  bivariate  normality  in  traditional  parametric  estimation  procedures  are  not  justified  in  the  current 
data.  Note  that  this  observation  may  result  in  confidence  intervals  that  are  wider  or  narrower  for  any  given  sample  size 
than  intervals  obtained  from  traditional  parametric  estimation. 

3.  Corrections  for  direct  range  restriction  did  not  substantively  influence  whether  the  bootstrapped  90%  confidence 
interval  contained  0.  Future  applications  should  assess  whether  this  holds  true. 

4.  Given  the  apparent  absence  of  bivariate  normality  in  the  current  data,  similar  bootstrapping  procedures  should  be 
used  to  assess  whether  the  90%  confidence  intervals  for  p  -  po  and  RYJCJX2  -  RYJC1  contain  0. 

Overall,  the  results  suggest  that  bootstrapping  of  validity  coefficients  in  controller  selection  research  may  be  technically 
feasible.  However,  legal  considerations  may  limit  practical  use  of  the  methodology  until  accepted  professional  guidelines, 
standards,  and  principles  are  revised  to  accommodate  innovative  methodologies. 
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GLOSSARY  OF  STATISTICAL  SYMBOLS 


B  =  Number  of  bootstrap  iterations 
H0  =  Null  hypothesis 
n  =  Bootstrap  sample  size 
N  =  Population  size 

rb  -  Correlation  computed  on  bootstrap  sample 
=  Sample  estimate  of  p 

rc  =  Estimate  of  p  corrected  for  restriction  in  range  on  X 
=  Estimate  of  p  derived  from  the  range  -  restricted  sample 
Rv  xl  =  Estimate  of  regression  of  Xx  on  Y 
Ry.xix2  =  Estimate  of  regression  of  Xx  and  X2  on  Y 

p  =  Population  or "  true"  Pearson  product  -  moment  correlation 
SEr  =  Standard  error  of  the  correlation 
cr  =  Standard  deviation  of  a  score  in  sample 
s2r  =  Asymptotic  variance  of  p 
sx  =  Standard  deviation  of  X  in  the  population 
s'  =  Standard  deviation  of  X  in  the  range  -  restricted  sample 
ta/2  =  Critical  value  of  t  in  two  -  tailed  test  at  the  desired  confidence  level 
X  =  Predictor  score 
X  =  Mean  predictor  score 

Y  =  Criterion  score 

Y  =  Mean  criterion  score 


Guidelines  for  Bootstrapping  Validity  Coefficients 
In  ATCS  Selection  Research 


INTRODUCTION 

The  Air  Traffic  Control  Specialist  (ATCS)  occu¬ 
pation  is  the  single  largest  (about  17,000  persons) 
and  most  publicly  visible  occupational  group  in  the 
Federal  Aviation  Administration  (FAA).  Air  traffic 
controllers  are  at  the  heart  of  a  web  of  radars,  comput¬ 
ers,  and  communication  facilities  that  comprise  an 
increasingly  complex  and  busy  air  transportation 
system.  Competitive  examinations  have  been  used  to 
determine  entry  into  the  occupation  since  1964 
(Brokaw,  1984).  Validation  of  these  competitive 
examinations  has  traditionally  relied  on  concurrent, 
criterion-related  designs  with  substantial  samples  of 
incumbent  controllers.  For  example,  about  800  in¬ 
cumbent  controllers  were  sampled  from  15  major 
cities  in  an  early  1972  validation  study.  A  subsequent 
longitudinal  study  drew  data  from  over  2,300  con¬ 
trollers  (Sells,  1984).  The  written  test  battery  used 
between  1981  and  1992  for  the  selection  of  control¬ 
lers  was  validated  on  samples  ranging  in  size  from  900 
to  over  3,000  (Boone,  1979).  More  recently,  samples 
of  438  controller  trainees  and  296  incumbent  con¬ 
trollers  were  used  in  predictive  and  concurrent  crite¬ 
rion-related  validation  studies  of  a  new  generation  of 
computer-administered  tests  for  the  occupation 
(Broach  &  Brecht-Clark,  1993). 

Use  of  such  large  samples  in  validation  studies 
imposes  significant  operational  and  financial  bur¬ 
dens  on  the  agency.  For  example,  rearrangement  of 
work  schedules  in  field  facilities  is  often  required  to 
allow  controllers  to  participate  in  the  studies  and  to 
ensure  appropriate  coverage  of  control  positions. 
Consequently,  overtime  costs  may  be  incurred  by  the 
facility  to  ensure  adequate  staffing  during  data  col¬ 
lection  efforts.  Other  incurred  costs  include  (1)  direct 
travel  costs  to  bring  the  controller  to  the  test  site  or  the 
test  to  the  controller,  and  (2)  salary  costs  for  the  partici¬ 
pating  controllers.  More  efficient  designs  that  require 
fewer  controllers  for  selection  test  validation  research 
are  needed  by  the  FAA  to  reduce  the  resource  costs 
associated  with  validation  of  controller  selection  tests. 


One  possible  approach  is  to  maximize  the  infor¬ 
mation  gained  from  a  single  sample  of  controllers 
using  innovative,  emerging  statistical  techniques, 
such  as  bootstrapping,  to  estimate  the  population 
validity  coefficient  for  new  selection  tests.  The  fun¬ 
damental  task  in  criterion-related  selection  test  vali¬ 
dation  is  to  make  a  probability-based  inference  about 
the  magnitude  of  the  “true”  population  validity  coef¬ 
ficient,  p,  for  a  predictor,  on  the  basis  of  a  sample 
statistic,  r^y  computed  on  a  sample  of  applicants  or 
incumbents.  Bootstrapping  estimates  the  sampling 
distribution  of  a  statistic  by  iteratively  resampling 
cases,  with  replacement,  from  a  set  of  observed  data, 
and  computing  the  sample  statistic.  Confidence  in¬ 
tervals  about  the  sample  statistic  can  then  be  con¬ 
structed,  providing  an  empirical  basis  for  inferential 
statements  about  the  likely  magnitude  of  the  statistic. 
This  approach  allows  for  use  of  smaller  samples  for 
estimation  of  the  underlying  population  parameter. 
Application  of  bootstrapping  to  estimation  of  valid¬ 
ity  coefficients  might  allow  the  FAA  to  user  smaller 
samples  in  validation  studies,  thereby  reducing  re¬ 
source  costs. 

Bootstrapping  has  been  applied  to  the  estimation 
of  validity  coefficients  in  methodological  studies 
(Cooil,  Winer,  &  Rados,  1987;  Kromery  &  Hines, 
1995).  Other  methodological  studies  have  investi¬ 
gated  the  relationship  of  restriction  in  range  and 
sample  size  to  the  accuracy  of  the  confidence  interval 
about  a  bootstrapped  statistic  (Allen  &  Dunbar, 
1990;  Mendoza,  Hart,  &  Powell,  1991).  However, 
bootstrapped  estimates  of  criterion-related  validity 
coefficients  have  not  appeared  in  applied  studies  of 
personnel  selection  tests.  Moreover,  no  guidelines  or 
tables  relating  effect  size  (e.g.,  the  magnitude  of  the 
r^y  inferential  errors  (e.g.,  Type  I  and  II  errors), 
statistical  power,  and  sample  size  have  appeared  for 
use  by  practitioners  in  designing  selection  test  crite¬ 
rion-related  validation  studies  with  bootstrapping  in 
mind.  The  purpose  of  this  study  was  to  develop 
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empirically-based  guidelines  and  recommendations 
for  estimating  sample  sizes  required  to  attain  reason¬ 
able  and  stable  bootstrapped  estimates  of  validity 
coefficients  in  concurrent,  criterion-related  valida¬ 
tion  of  ATCS  aptitude  tests  under  conditions  of 
explicit  and  incidental  restriction  in  range. 

TECHNICAL  BACKGROUND 

Traditional  parametric  estimation  of  sample 
size  requirements 

Correlation  coefficient.  The  Pearson  product 
moment  correlation  coefficient,  p,  reflects  the  strength 
of  the  linear  relationship  between  two  variables 
(Galton,  1888).  A  sample  estimate  of  p  is  derived 
using  the  following  formula: 

(Equation  „ 


Where 

=  sample  estimate  of  g 
X  =  predictor  score 
X  =  mean  predictor  score 

Y  =  criterion  score 

Y  =  mean  criterion  score 


Brogden  (1949)  demonstrated  50  years  ago  that 
the  economic  utility  of  a  personnel  selection  system 
is  a  direct  function  of  the  strength  of  the  predictor- 
criterion  (X-Y)  relationship.  More  recently,  Russell, 
Colella,  and  Bobko  (1993)  found  that,  if  r  increases 
by  a  factor  of  2,  gross  value-added  to  the  organization 
doubles.  Hence,  an  accurate  sample  estimate  of  the 
population  parameter  p  provides  key  insight  into 
how  well  a  personnel  selection  system  is  working  and 
its  utility  to  the  organization. 

However,  this  begs  the  question,  “How  large  does 

the  sample  have  to  be  to  ensure  r  is  an  ‘accurate5 
r  .  . 

estimate  of  p?”  Traditional  means  of  answering  this 
question  use  parametric  assumptions  about  distribu¬ 
tional  characteristics  of  X and  Y  For  example,  Fisher 
(1915,  1970,  p.  194)  described  the  asymptotic  vari¬ 
ance  of  a  correlation  (52  )  between  two  bivariate 
normally  distributed  variables  X  and  Y as: 

siJ±£li 

N  (Equation  2) 


Where 

N  =  sample  size. 

A  minor  variation  of  this  formula  yields  the  esti¬ 
mate  of  standard  error  )  used  in  the  denomina¬ 

tor  of  f-tests  of  H  :  r  =0,  or: 

o  xy 


SE=l^- 

V  N—  2  (Equation  3) 

The  estimate  for  standard  error  of  r  permits 
derivation  of  confidence  intervals.  The  confidence 
interval  (Cl)  is  defined  as 


Cl  =  ±  t^JSE)  (Equation  4) 


Where 

t^2  -  the  critical  value  of  t in  a  two-tailed  test  at  the 
desired  confidence  level  (a  =  .10,  t  =  1.645  for  the 
90%  confidence  interval,  or  a  =  .05,  t  -  1.96  for  the 
95%  confidence  interval). 

For  example,  the  SE  would  be  .07  for  a  sample  of 
N  =  200  and  an  p  of  .20  between  two  bivariate 
normally  distributed  variables.  One  could  be  90% 
sure  the  true  population  parameter  p  will  fall  in  the 
interval  .20  ±  1.645 (.07),  or  .08  to  .32.  Similarly,  one 
would  be  95%  sure  the  true  population  p  will  fall  in 
the  interval  .20  ±  1.96(0.20),  or  .06  to  .34.  In  this 
example,  the  90%  and  95%  confidence  intervals  do 
not  include  zero,  and  one  could  reasonably  infer  the 
population  correlation  coefficient  was  not  zero. 

Assume  the  organization  knows  in  advance  that 
minimally  acceptable  criterion-related  validity  must 
be  r ^  =  .20  for  a  new  personnel  selection  test  to  add 
economic  value.  One  could  then  work  backwards  to 
determine  the  minimum  sample  size  needed  to  en¬ 
sure  zero  does  not  fall  in  the  confidence  interval  (e.g., 
the  null  hypothesis  Hq:  p=0  can  be  rejected  at 
p{ type  I  error)  <  .10)  if  in  fact  p  =  .20.  For  example, 
the  minimum  sample  size  needed  to  detect  r  -  .20 
between  two  bivariate  normal  variables  at^>  <  .  10  (2- 
tailed)  can  be  obtained  by  solving  for  Nos  follows: 


.20  =  1.645 


(1-.202) 

TV— 2 


,  or,  A^=  67  for  a  =  .10 


Note,  the  median  sample  size  of  criterion  validity 
studies  reported  in  the  Journal  of  Applied  Psychology 
and  Personnel  Psychology  between  1965  and  1991  was 
N=  104  (Russell  et  al.,  1994). 
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Restriction  in  range.  However,  the  distribution  of 
predictor  scores  is  generally  non-normal  in  concur¬ 
rent,  criterion-related  validity  studies  due  to  range 
restriction,  as  the  predictor  has  been  used  to  select  the 
incumbents.  For  example,  applicants  to  the  ATCS 
occupation  competed  under  civil  service  rules  on  the 
basis  of  a  composite  of  written  aptitude  test  scores 
(Broach,  1998).  The  distribution  of  that  composite 
score  for  205,592  ATCS  applicants  (out  of  over 
400,000  since  1981)  is  presented  in  Figure  1.  The 
distribution  of  that  composite  score  for  the  10,869 
applicants  competitively  selected  into  the  FAA  be¬ 
tween  1986  and  1992  is  illustrated  in  Figure  2.  Both 
figures  superimpose  what  a  normal  curve  with  the 
same  mean  and  standard  deviation  as  data  contained 
in  the  graph  would  look  like.  Note  both  distributions 
are  distinctly  non-normal  and  negatively  skewed;  the 
distribution  of  applicant  composite  scores  (Figure  1) 
evidences  some  degree  of  bi-modality.  The  correla¬ 
tion  between  the  composite  score  and  subsequent 
performance  in  FAA  Academy  initial  ATCS  training 
for  the  10,869  competitive  entrants  was  r  =  .182. 
However,  the  “true”  population  validity  (p)  is  likely 
to  be  much  larger  than  .182  in  the  N  =  205,592 
applicant  population. 

Ghiselli  (1964)  derived  a  correction  formula  that 
yields  a  more  accurate  estimate  of  p,  i.e.,  what  would 
have  been  expected  if  predictor  and  criterion  data  had 
been  available  on  all  applicants  (Bobko  &  Rieck, 
1980;  Linn,  Harnisch,  &  Dunbar,  1981).  The  for¬ 
mula  correcting  for  direct  range  restriction  is: 


xy\ 


rc  = 


v 

V  S’J 


v 

\SxJ 


Where 


(Equation  5) 


rc  =  the  estimate  of  p  corrected  for  range  restriction  on  X 
=  the  estimate  of  p  derived  from  the  range  restricted  sample 

sx  -  the  standard  deviation  of  X  in  the  range  restricted  sample 

sx  =  the  standard  deviation  of  X  in  the  non  -  range  restricted  population 


Application  of  this  formula  to  the  correlation 
between  ATCS  aptitude  composite  score  and  FAA 
Academy  performance  of  -  .182  yields: 


r  ~ 


i4.  ir 

5.02  j 


1-.1822  +.1822 


14.11 

5.02 


.512 

1.23 


=  .42 


Assumptions  about  the  underlying  distributions 
of  X  and  K  While  direct  range  restriction  on  the 
predictor  ^constitutes  a  known  violation  of  bivariate 
normality  that  can  be  “corrected”  for,  the  X  and  Y 
distributions  may  be  non-normal  for  any  one  of  a 
large  number  of  other  reasons.  Highly  skewed  (e.g., 
Figure  1)  or  multi-modal  distributions  of  the  predic¬ 
tor  or  criterion  cause  Fisher's  bivariate  normality 
assumption  to  be  violated.  For  example,  Fisher’s 
formula  assumes  the  sample  was  drawn  from  a  single 
population  characterized  by  a  single  value  of  p.  Un¬ 
fortunately,  if  the  independent  and  dependent  vari¬ 
ables  are  distinctly  non-normal,  as  appears  to  be  the 
case  in  ATCS  aptitude  test  scores,  “tests  . . .  based  on 
the  large  sample  formula  are  often  very  deceptive” 
(Fisher,  1970,  p.  195). 

Moreover,  it  is  possible  that  AT CS  applicants  were 
drawn  from  multiple  populations,  each  with  its  own 
unique  value  of  p.  For  example,  Russell  and  Dean 
(1997)  reported  evidence  of  multiple  population 
values  of  p  in  a  sample  of  N>  15,000  applicants  hired 
using  the  General  Aptitude  Test  Battery  (GATB) 
over  a  10-year  period.  ATCS  applicants  were  re¬ 
cruited  from  diverse  demographic  and  geographic 
groups;  it  is  possible  that  unique  values  of  p  might 
characterize  the  population  of  applicants  from  rural 
areas  with  just  high  school  diplomas  compared  to  the 
sub-population  of  applicants  from  large  cities  with 
college  degrees.  If  applicants  were  drawn  from  multiple 
populations  with  unique  p,  using  Fisher’s  formula  to 
estimate  confidence  intervals  and  the  sample  sizes 
required  to  attain  them  will  be  incorrect. 
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In  sum,  correction  formulae  can  be  derived  when 
deviations  from  bivariate  normality  are  well  under¬ 
stood,  e.g.,  in  the  case  of  direct  range  restriction  in 
the  predictor.  Unfortunately,  when  the  distributional 
characteristics  of  X and  Y are  unknown,  the  underly¬ 
ing  distributional  characteristics  of  are  also  un¬ 
known,  and  Equation  2  cannot  be  used  to  estimate 
required  sample  size. 

Bootstrap  Estimation  Procedures 

Recently,  Efron  (1979)  presented  a  new  method  of 
empirically  estimating  characteristics  of  population 
distributions  from  sample  data,  called  bootstrapping. 
Bootstrapping  estimates  the  sampling  distribution  of 
a  statistic  by  iteratively  resampling  cases  from  a  set  of 
observed  data.  Basically,  B  “bootstrap”  samples  of 
size  n  are  taken  with  replacement  from  the  original 
sample  of  size  TV  and  saved  to  a  file.  An  investigation 
using  B  =  1,000  bootstrap  samples  of  size  n  will 
essentially  be  able  to  approximate  the  actual  sampling 
distribution  that  would  have  been  obtained  if  mul¬ 
tiple  independent  samples  of  size  N were  drawn  from 
the  population.  Bootstrapping  is  computationally 
time  intensive,  as  the  sample  at  hand  is  resampled 
with  replacement  many  times  to  derive  the  distribu¬ 
tion  of  the  statistic  of  interest. 

There  are  many  advantages  to  using  the  bootstrap 
technique.  First,  it  is  not  restricted  to  the  normality 
assumptions  of  parametric  tests.  The  percentile 
bootstrapping  method  (Efron  &  Tibshirani,  1993, 
chapter  13)  generates  confidence  intervals  directly 
from  the  bootstrapped  sampling  distribution  (e.g.,  if 
B  =  1,000  bootstrap  samples  are  taken,  the  bootstrap 
correlations  (r^)  representing  the  5th  and  95th  percen¬ 
tile  points  would  fashion  the  lower  and  upper  points 
of  a  90%  Cl).  Of  interest  in  this  application  is 
graphical  interpretation  of  rb  frequency  distributions 
(Efron  &  Tibshirani,  1993).  Evidence  of  multi¬ 
modality  would  suggest  the  presence  of  multiple  sub¬ 
populations  in  the  sample,  each  with  a  unique  p. 
Second,  information  concerning  the  form  of  the 
original  sample  is  retained,  with  no  loss  of  distribu¬ 
tional  information.  Rasmussen  (1987)  noted  such 
loss  of  information  does  occur  when  nonparametric 
techniques  convert  data  to  ranks,  which  is  why 
Lunneborg  (1985)  described  bootstrapping  as  falling 
between  parametric  and  nonparametric  procedures 
for  making  probabilistic  inferences. 


The  main  disadvantage  of  the  technique  is  that  one 
must  be  confident  the  sample  examined  is  indeed 
representative  of  the  population  from  which  it  was 
drawn.  Other  than  differences  due  to  direct  range 
restriction  (which  can  be  corrected  for),  this  assump¬ 
tion  appears  to  be  met  for  studies  of  controller  selec¬ 
tion  due  to  the  large  sample  sizes.  Regardless, 
applications  of  parametric  statistical  estimation  pro¬ 
cedures  effectively  make  the  same  assumption.  For 
example,  if  the  test  statistic  of  interest  falls  in  the 
critical  region,  the  investigator  rejects  the  null  hy¬ 
pothesis  and  proceeds  to  draw  implications  for  theory 
and  practice  as  if  what  is  true  in  the  sample  is  also  true 
in  the  population.  All  inferential  statistics  must  make 
the  basic  assumption  that  evidence  drawn  from  the 
sample  (e.g.,  rejection  or  lack  of  rejection  of  Ho) 
generalizes  to  the  population. 

An  example.  Rasmussen  (1987)  presented  the  fol¬ 
lowing  simple  example  to  explain  the  bootstrap  pro¬ 
cedure.  The  computer  initially  is  presented  with  a 
data  set  containing  10  graduate  students’  first  year 
grade  point  average  (GPA)  and  Graduate  Record 
Exam  (GRE)  scores.  Then  a  bootstrap  sample  (B2)  is 
randomly  drawn  with  replacement  from  these  10 
observations,  causing  the  possibility  of  some  observa¬ 
tions  being  represented  more  than  once  in  the  boot¬ 
strap  sample  while  other  observations  are  not  included. 
A  single  bootstrap  sample  may  include  the  following 
cases:  5,  2,  8,  6,  2,  7,  9,  6,  1,  and  2,  resulting  in  a 
correlation  of  rbl  =  .59.  This  procedure  is  repeated  a 
large  number  of  times  (e.g.,  B  =  1,000)  and  each  rb  is 
saved  to  a  separate  file.  The  bootstrap  correlations  (r^) 
are  then  rank  ordered  with  the  50th  and  950th 
correlations  representing  90%  confidence  interval 
end  points.  The  null  hypothesis  of  PGPAGRE  =  0  is 
tested  by  determining  whether  0  falls  within  the 
confidence  interval  (Rasmussen,  1987).  The  mean 
value  of  rb  across  all  B  =  1000  bootstrap  samples 
would  be  the  best  estimate  of  pGpA  GR£. 

Issues  in  bootstrapping.  To  determine  the 
bootstrap’s  appropriateness,  studies  have  been  con¬ 
ducted  examining  the  similarity  in  results  between 
the  bootstrap  and  traditional  statistical  approaches 
under  conditions  in  which  the  parametric  assump¬ 
tions  were  met  (e.g.,  Diaconis  &  Efron,  1983;  Efron, 
1985,  1986;  Lunneborg,  1985).  These  studies  re¬ 
sulted  in  bootstrap  statistics  (e.g.,  estimates  of  confi¬ 
dence  intervals)  that  were  extremely  close  to  those 
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generated  from  traditional  parametric  approaches. 
Bickel  and  Freedman  (1981;  Freedman,  1981)  dem¬ 
onstrated  the  bootstrap  was  asymptotically  valid  for 
many  statistics  (e.g.,  t  and  regression  statistics). 

A  number  of  issues  remain  unresolved  in  using 
bootstrapping  to  conduct  hypothesis  testing.  Most  of 
these  issues  revolve  around  the  relative  accuracy  of 
parametric  versus  bootstrap  procedures  in  estimating 
probability  intervals  at  the  extreme  tails  of  known 
(i.e.,  normal)  distributions.  However,  the  percentile 
method  of  estimating  confidence  intervals,  as  de¬ 
scribed  by  Efron  and  Tibshirani  (1993),  provides 
“good  theoretical  coverage  properties  as  well  as  rea¬ 
sonable  stability  in  practice”  (p.  169).  Good  “theo¬ 
retical  coverage”  refers  to  confidence  intervals  that  1) 
accurately  estimate  the  probability  of  the  population 
parameter  falling  within  the  confidence  interval  and 
2)  divide  “coverage  error”  equally  across  the  two  tails. 

Hence,  the  percentile  bootstrap  method  of  esti¬ 
mating  confidence  intervals  might  be  used  to  esti¬ 
mate  the  distributional  characteristics  of  r  under 

xy 

actual  conditions  faced  by  the  FAA  in  the  selection  of 
air  traffic  controllers.  In  this  study,  bootstrap  proce¬ 
dures  were  applied  to  archival  ATCS  selection  data  to 
estimate  sampling  distributions  for  r  obtained  with 
B=  1,000  samples  of#  =  25,  50, 75, ...  to  200.  Results 
are  presented  with  and  without  correction  for  direct 
range  restriction. 

METHOD 

Sample 

The  Civil  Aeromedical  Institute  provided  archival 
ATCS  written  aptitude  test  scores  for  205,592  ex¬ 
aminations  for  the  period  1981  to  1992.  The  Insti¬ 
tute  also  provided  test  and  criterion  data  for  the 
10,869  persons  competitively  selected  into  the  ATCS 
occupation  from  October  1985  through  January  1992. 

Measures 

Predictor.  The  written  AT CS  aptitude  test  battery 
consisted  of  three  tests:  (a)  the  Multiplex  Controller 
Aptitude  Test  (MCAT);  (b)  the  Abstract  Reasoning 
Test  (ABSR);  and  (c)  the  Occupational  Knowledge 
Test  (OKT).  The  MCAT  was  a  timed,  110-item 
paper-and-pencil  civil  service  test  (OPM  test  No. 
510)  simulating  activities  required  for  control  of  air 
traffic.  Multiple,  parallel  forms  of  these  test  were 
available  (Lilienthal  &C  Pettyjohn,  1981).  Aircraft: 


locations  and  direction  of  flight  were  indicated  with 
graphic  symbols  on  a  simplified,  simulated  radar 
display  (Figure  3).  An  accompanying  table  provided 
relevant  information  required  to  answer  the  item, 
including  aircraft  altitudes,  speeds,  and  planned  routes 
of  flight.  MCAT  test  items  required  examinees  to 
identify  situations  resulting  in  conflicts  between  air¬ 
craft,  interpret  tabular  and  graphical  information  , 
and  to  solve  time,  speed,  and  distance  problems.  The 
ABSR  was  a  timed,  multiple-choice,  50-item  civil 
service  examination  (OPM  test  No.  1 57).  To  solve  an 
item,  examinees  determined  what  relationships  ex¬ 
isted  within  sets  of  symbols  or  letters.  The  examinee 
then  identified  the  next  symbol  or  letter  in  the  pro¬ 
gression,  or  the  element  missing  from  the  set.  A 
sample  ABSR  item  is  presented  in  Figure  4.  The  OKT 
was  a  timed,  multiple-choice  80-item  job  knowledge 
test  that  contained  items  related  to  seven  knowledge 
domains  relevant  to  aviation,  generally,  and  to  air 
traffic  control  phraseology  and  procedures,  specifi¬ 
cally.  The  OKT  was  developed  as  an  alternative  to 
self-reports  of  aviation  and  air  traffic  control  experi¬ 
ence.  The  OKT  was  found  to  be  more  predictive  of 
performance  in  ATCS  training  than  self-reports 
(Dailey  &  Pickrel,  1984;  Lewis,  1978). 

The  development  of  the  written  ATCS  aptitude 
test  battery  has  been  extensively  described  elsewhere 
(Brokaw,  1984;  Collins,  Boone,  &  VanDeventer, 
1984;  Manning,  1991;  Sells,  1984;  Sells,  Dailey,  & 
Pickrel,  1984).  The  test-retest  correlation  for  the 
MCAT  was  estimated  at  .60  in  a  sample  of  6 17  newly 
hired  controllers  (Rock,  Dailey,  Ozur,  Boone,  & 
Pickrel,  1982,  p.  59).  Parallel  form  reliability,  as 
computed  on  the  same  sample,  ranged  from  .42  to  .89 
for  various  combinations  of  items  (Rock  et  al.,  p. 
103).  Lilienthal  and  Pettyjohn  (1981)  examined  in¬ 
ternal  consistency  and  item  difficulties  for  10  ver¬ 
sions  of  the  MCAT.  Cronbach’s  alpha  ranged  from 
.63  to  .93;  the  alphas  for  7  of  the  10  versions  were 
greater  than  .80.  In  contrast,  no  item  analyses,  paral¬ 
lel  form,  test-retest,  or  internal  consistency  estimates 
of  the  ABSR  test  have  been  reported. 

Weighted  MCAT  and  ABSR  raw  scores  were  summed 
and  transformed  to  a  score  with  a  mean  of  70  and 
maximum  of  100,  known  as  the  Transmuted  Compos¬ 
ite  Score  (TMC).  No  estimates  for  the  reliability  of  this 
composite  score  have  been  reported.  About  half  of  all 
applicants  were  expected  to  score  at  or  above  the  mean 
(Rock,  Dailey,  Ozur,  Boone,  &;  Pickrel,  1984). 
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Criterion.  The  criterion  was  performance  in  the 
FAA  Academy  initial  ATCS  training  program,  known 
as  the  ATCS  Non-radar  Screen  (“the  Screen”).  Under 
the  Uniform  Guidelines  for  Employee  Selection  Proce¬ 
dures  (Equal  Employment  Opportunity  Commis¬ 
sion,  1978),  training  may  be  used  as  a  criterion 
measure  where  success  in  training  is  “properly  mea¬ 
sured,”  and  the  relevance  of  the  training  can  be 
demonstrated  through  comparison  of  training  con¬ 
tent  to  critical  or  important  job  behaviors,  or  by 
showing  that  training  measures  are  related  to  subse¬ 
quent  measures  of  job  performance.  The  Screen  was 
originally  established  in  response  to  recommenda¬ 
tions  made  by  the  U.S.  Congress  House  Committee 
on  Government  Operations  (U.S.  Congress,  1976) 
to  “...provide  early  and  continued  screening  to  insure 
(sic)  the  prompt  elimination  of  unsuccessful  trainees 
and  relieve  the  regional  facilities  of  much  of  this 
burden”  (p.  13).  The  Screen  was  based  upon  a  min¬ 
iaturized  training-testing-evaluation  personnel  selec¬ 
tion  model  (Siegel,  1978,  1983;  Siegel  &  Bergman, 
1975)  in  which  individuals  with  no  prior  knowledge 
of  an  occupation  are  trained  and  then  assessed  for 
their  potential  to  succeed  in  the  job.  Performance  in 
the  Screen  has  been  shown  to  predict  subsequent 
performance  in  radar  based  training  one  to  two  years 
after  entry  into  the  occupation  (Broach  &  Manning, 
1994),  as  well  as  completion  of  the  rigorous  on-the- 
job  training  sequence  and  certification  as  a  qualified 
“full  performance  level”  (FPL)  controller  (Broach, 
1998;  Della  Rocco,  1998;  Della  Rocco,  Manning,  & 
Wing,  1990;  Manning,  Della  Rocco,  &  Bryant,  1989). 

Thirteen  assessments  of  performance,  including 
six  classroom  tests,  observations  of  performance  in 
six  laboratory  simulations  of  non-radar  air  traffic 
control,  and  a  final  written  examination,  were  made 
during  the  Screen  (Broach,  Farmer,  &  Young,  in 
review;  Della  Rocco,  1999;  Della  Rocco,  Manning, 
&  Wing,  1990).  The  final  summed  composite  score 
(NLCOMP)  was  weighted  20%  for  the  classroom 
tests,  60%  for  laboratory  simulations,  and  20%  for 
the  final  examination.  A  minimum  NLCOMP  score 
of  70  was  required  to  pass.  The  final  composite  score 
was  the  criterion  measure  in  this  study. 

Bootstrap  procedures 

The  SYSTAT7.01  statistical  package,  published 
by  SPSS  Inc.,  was  used  for  all  derivations  (syntax  files 
are  available  from  the  first  author).  Each  bootstrap 


procedure  reported  below  followed  a  four-step  se¬ 
quence  to  yield  “percentile”  confidence  intervals,  as 
described  by  Efron  and  Tibshirani  (1993). 

Stepl:  Number  of  iterations.  Decide  how  many 
bootstrap  samples  (B)  to  take.  Evidence  suggests 
bootstrap  estimates  of  common  statistics’  distribu¬ 
tional  characteristics  tend  to  stablize  when  the  num¬ 
ber  of  bootstrap  samples  drawn  approaches  B  =  200 
(Efron  &  Tibshirani,  1993,  p.  52).  However,  point 
estimates  of  confidence  interval  percentiles  (e.g.,  the 
5th  and  95th  percentiles)  are  subject  to  greater  error 
in  estimation.  Efron  and  Tibshirani  recommend  ex¬ 
tracting  500  to  1,000  bootstrap  samples  to  minimize 
estimation  error  (1993,  p.  252).  Hence,  to  ensure 
accuracy,  all  bootstrap  procedures  reported  here  it¬ 
eratively  drew  B  =  1,000  bootstrap  samples  with 
replacement. 

Step  2:  Number  of  sampled  observations  for 
bootstrap.  Decide  how  many  observations  should  be 
drawn  in  each  of  the  Bx  to  2?1000  bootstrap  samples. 
Given  the  parametric  estimate  of  sample  size  required 
in  the  current  data  for  the  sample  stastistic  r  =.182 
to  reject  Hq:  p  =  0  at  a  =  .10  was  N=  81  (as  derived 
from  Equation  3),  eight  independent  bootstraps  of  n 
-  25  through  n  =  200  were  performed.  In  other 
words,  first,  B  =  1,000  samples  of  size  n  =  25  were 
drawn,  with  replacement,  from  the  original  sample  of 
N  =  10,869.  Then  B  =  1,000  samples  of  size  n  -  50 
were  drawn,  with  replacement,  from  the  original 
sample,  followed  by  B  =  1,000  samples  of  size  n  -  75, 
B  =  1,000  samples  of  size  n  -  100,  and  so  forth  until 
a  total  of  eight  bootstrap  operations  had  been  per¬ 
formed  for  n  =  25,  50,  75,  ...,  200. 

Step  3:  Compute  bootstrapped  statistic.  The 
TMC-NLCOMP  Pearson  product  moment  correla¬ 
tion  (rb)  and  TMC  standard  deviation  were  derived 
for  each  bootstrap  sample  {Bx  to  i?1000)  and  saved  to  a 
file  labeled  TMC25.  This  procedure  was  repeated 
independently  for  «  =  50, 75, 100, 125, 150, 175,  and 
200,  yielding  additional  output  files  labeled  TMC50 
through  TMC200. 

Step  4:  Examine  distribution  of  bootstrapped 
statistic.  Correlations  (rb)  derived  from  each  boot¬ 
strap  procedure  were  sorted  and  values  correspond¬ 
ing  to  the  5th,  50th,  and  95th  percentile  identified. 
The  frequency  with  which  each  rb  value  occured  was 
then  plotted  graphically,  with  the  5th,  50th,  and  95th 
percentile  values  labeled  below  the  X-axis. 
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Analyses 

Uncorrected  correlations.  Three  analyses  were 
performed  to  generate  different  distributions  of  rh  for 
each  bootstrap  sample  size  ( n  =  25,  50,  ...  200).  A 
frequency  distribution  of  rb  was  plotted  and  the  5th, 
50th,  and  95th  percentile  values  for  the  simple, 
uncorrected  TMC-NLCOMP  correlation  were  de¬ 
rived.  For  comparison  purposes,  the  normal  curve 
with  a  mean  and  standard  deviation  identical  to  that 
found  in  the  rb  frequency  distribution  was  superim¬ 
posed.  Basic  sampling  theory  predicts  that  the  inter¬ 
val  between  the  5th  and  95th  percentile  values  of  rb 
will  decrease  as  sample  size  increases.  The  smallest 
bootstrap  sample  size  {n)  with  a  90%  confidence 
interval  that  no  longer  contains  0  will  approximate 
the  minimum  sample  size  (AO  needed  to  ensure  r  = 
.182  will  reject  Ho:  p  =  0  at  a  =  .10.  Computational 
time  required  for  this  procedure  ranged  from  two 
hours  (bootstrap  n  -  25)  to  6  hours  (bootstrap  n  - 
200)  on  a  233  Mhz  Intel  Pentium®  personal  com¬ 
puter.  Graph  A-l  in  Appendix  A  portrays  the  fre¬ 
quency  distribution  output  for  B  =  1,000  bootstrap 
samples  of  size  n  =  25  for  the  simple,  uncorrected 
TMC-NLCOMP  correlation.  Graphs  A-2  through 
A-8  in  Appendix  A  present  the  frequency  distribu¬ 
tions  for  rb  derived  for  B  =  1,000  bootstrap  samples  of 
w  =  50, 75,  100,  125,  150,  175,  and  200,  respectively. 
The  logical  flow  of  this  analysis  is  illustrated  in 
Figure  5. 

Correlations  corrected  for  restriction  in  range. 
Second,  Ghiselli’s  (1964)  correction  formula  for  di¬ 
rect  range  restriction  was  applied  to  each  rb  within  the 
TMC25,  TMC50,  ...  andTMC200  files.  The TMC 
standard  deviation  (^  for  each  bootstrap  sample  was 
computed  and  saved  to  the  file  with  each  respective 
rb.  Subsequently,  each  rb  was  corrected  using  s’x  de¬ 
rived  from  the  bootstrap  sample  from  which  it  was 
drawn,  and  s  =  14. 1 1,  derived  from  the  N=  206,592 
applicant  population.  The  corrected  bootstrapped 
correlation  coefficients  (rb)  were  rank  ordered  and 
plotted,  yielding  a  frequency  distribution  with  the 
5th,  50th,  and  95th  percentile  points  indicated  on 
the  X-axis.  The  flow  of  this  analysis  is  illustrated  in 
Figure  6.  Again,  for  purposes  of  comparison,  a  nor¬ 
mal  curve  with  a  mean  and  standard  deviation  iden¬ 
tical  to  that  found  in  the  corrected  rb  frequency 
distribution  was  superimposed  on  the  rb  frequency 
distribution.  Graph  B-l  in  Appendix  B  is  the  result  of 
this  procedure  applied  to  the  B  =  1,000  bootstrap 
samples  of  size  n  =  25.  Graphs  B-2  through  B-8  in 


Appendix  B  present  the  frequency  distributions  for  rh 
corrected  for  restriction  in  range  for  B  =  1,000  boot¬ 
strap  samples  of  n  =  50,  75,  100,  125,  150,  175,  and 
200,  respectively. 

Correlations  generated  for  bivariate  normal  popu¬ 
lation.  Finally,  using  the  SYSTAT7.01  random  nor¬ 
mal  function,  1 ,000  rb  were  generated  from  B=  1 ,000 
samples  of  n  =  25  taken  from  a  bivariate  normal 
population  with  p  =  .182.  The  standard  deviation 
was  computed  as  G  =  .1974,  based  on  Equation  2. 
These  bivariate  normal  bootstrap ed  correlation  coef¬ 
ficients  were  rank  ordered  and  plotted,  yielding  a 
frequency  distribution  with  the  5th  and  95th  percen¬ 
tile  points  indicated  on  the  X-axis,  as  were  the  values 
p  -  .182  and  G  =  .1974  used  to  generate  the  data. 
Graph  C-l  in  Appendix  C  presents  the  result  of  this 
procedure.  This  procedure  was  repeated  for  samples 
of  n  =  50,  75,  ...  200,  computing  the  standard 
deviation  each  time  based  on  equation  2.  The  flow  of 
this  analysis  is  portrayed  in  Figure  7.  Graphs  C-2 
through  C-8  in  Appendix  C  present  the  frequency 
distributions  for  rh  derived  for  B  =  1,000  bootstrap 
samples  of  n  =  50,  75,  100,  125,  150,  175,  and  200, 
respectively. 

In  sum,  the  graphs  in  Appendix  A  capture  the 
bootstrapped  rb  frequency  distribution  for  TMC- 
NLCOMP  correlations  uncorrected  for  range  re¬ 
striction.  The  Appendix  B  graphs  capture  the 
bootstrapped  rb  frequency  distribution  for  TMC- 
NLCOMP  correlations  corrected  for  range  restric¬ 
tion.  Last,  the  graphs  in  Appendix  C  are  what  a 
bootstrapped  rb  frequency  distribution  for  TMC- 
NLCOMP  correlations  is  expected  to  look  like  if  the 
applicant  population  was  characterized  by  bivariate 
normal  distribution  of  TMC  and  NLCOMP  and 
Ptmcnlcomp  =  *182.  Confidence  interval  end  points 
for  corrected  and  uncorrected  bootstrap  procedures 
are  summarized  in  Table  1. 

RESULTS 

A  number  of  inferences  can  be  drawn  from  the 
graphs  and  their  respective  confidence  intervals.  First, 
and  perhaps  most  obvious,  visual  interpretation  sug¬ 
gests  that  the  distributional  characteristics  of  rb  (cor¬ 
rected  or  uncorrected  for  direct  range  restriction)  are 
not  what  would  be  expected  under  conditions  of 
bivariate  normality  —  the  “C”  graphs  differ  mean¬ 
ingfully  from  the  “A”  and  “B”  graphs.  Hence,  for  the 
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90%  confidence  interval  or  a  =  .10,  the  estimated  N 
=  81  sample  size  required  to  detect  p  =  .182  derived 
under  parametric  assumptions  is  spurious. 

Second,  examination  of  the  confidence  intervals 
summarized  in  Table  1  indicates  N~  175  or  greater 
is  required  to  ensure  the  90%  confidence  interval 
does  not  contain  0  (i.e.,  Ho:  p  =  0  will  be  rejected)  in 
these  archival  data.  The  actual  distributional  charac¬ 
teristics  of  these  data,  as  revealed  by  the  bootstrap 
procedure,  suggest  a  larger  sample  (N>  175)  will  be 
required  to  reject  Ho:  p  =  0  than  would  be  required  if 
the  joint  TMC-NLCOMP  space  was  bivariate  nor¬ 
mal  (N  =  81). 

Third,  given  the  relatively  tight  range  of  observed 
TMC  values  in  the // =  1 0,896  competitively  selected 
controllers,  virtually  no  outliers  were  present. 
Bootstrapping  procedures  are  most  subject  to  estima¬ 
tion  error  when  the  original  sample  contains  infre¬ 
quent,  extreme  outliers  (Efron  &  Tibshirani,  1993); 
Figure  2  indicates  this  was  not  a  problem  in  the 
current  data. 

Fourth,  the  same  cannot  be  said  of  TMC  values  in 
the  original  applicant  pool,  which  suggest  a  small 
group  of  extremely  low  TMC  values  lie  some  distance 
from  the  rest  of  the  observations.  These  outliers  will 
inflate  the  non-range  restricted  standard  deviation 
estimate  {sj  in  Ghiselli’s  (1964)  correction  formula. 
This  may  have  been  due  to  labor  pool  “history  ef¬ 
fects”  associated  with  the  Professional  Air  Traffic 
Controller  Organization  (PATCO)  strike  of  the  early 
1980s.  That  is,  there  may  have  been  a  higher  than 
usual  frequency  of  low-ability,  unsuccessful  appli¬ 
cants  attracted  by  the  publicity  about  the  ATCS 
occupation  following  the  strike.  If  the  outliers  were 
due  to  such  a  history  effect,  Ghiselli’s  correction  for 
range  restriction  may  represent  a  spurious  overcor¬ 
rection  when  estimating  p  in  future  applicant  pools 
that  are  not  influenced  by  a  similar  history  effect. 

Finally,  noting  that  this  last  caveat  holds  for  all 
inferences  drawn  from  the  current  analyses  — these 
results  will  generalize  to  future  criterion  validation 
efforts  only  to  the  extent  that  similar  TMC-NLCOMP 
distributional  characteristics  and  latent  TMC- 
NLCOMP  relationships  exist. 

Guidelines  and  Recommendations 

A  number  of  recommendations  can  be  drawn  for 
future  FAA  efforts  at  estimating  criterion  validity. 
First,  extremely  large  (1,000+)  sample  sizes  are  not 


required  to  yield  accurate  estimates  of  selection  bat¬ 
tery  criterion  validity.  Results  suggest  samples  in  the 
range  o(  N  =  200-500  ought  to  provide  whatever 
margin  of  error  might  be  needed  to  ensure  accurate 
estimation  of  p,  i.e.,  to  ensure  0  does  not  fall  in  the 
90%  confidence  interval.  A  number  of  additional 
recommendations  and  guidelines  follow: 

1.  Assumptions  of  bivariate  normality  in  traditional 
parametric  estimation  procedures  are  not  justified 
in  the  current  data.  Estimation  of  confidence 
intervals  and  tests  of  null  hypotheses  should  be 
performed  using  the  four-step  bootstrap  proce¬ 
dure  oudined  above.  Note  that  this  recommenda¬ 
tion  may  result  in  confidence  intervals  that  are 
larger  or  smaller  than  those  obtained  from  tra¬ 
ditional  parametric  estimation  for  any  given 
sample  size. 

2.  Corrections  for  range  restriction  did  not  substan¬ 
tively  influence  whether  the  bootstrap  estimated 
90%  confidence  interval  contained  0.  Future  ap¬ 
plications  should  continue  to  assess  whether  this 
holds  true.  Note,  under  parametric  assumptions, 
the  estimate  of  the  standard  deviation  of  r  is: 


S£Krc ) 


Equation  (6) 

Where 

sx  =  the  standard  deviation  of  X  in  the  unrestricted  population 

s'x  =  the  standard  deviation  of  X  in  the  range  restricted  sample 

sx  -  the  variance  of  X  in  the  unrestricted  population 

s'2  =  the  variance  of  X  in  the  range  restricted  sample 

r'2  =  the  squared  observed  correlation  in  the  range  restricted  sample 

r’=  the  observed  correlation  in  the  range  restricted  sample. 


This  correction  can  then  be  used,  again  under 
parametric  assumptions,  to  test  Ho  and  derive 
confidence  intervals  for  r.  Importantly,  while 
SD(r)cis  larger  than  SD(r)  under  conditions  of 
range  restriction,  SD(r)  does  not  increase  relative 
to  SD(r)  as  fast  as  r  increases  relative  to  r  (Bobko, 
1995).  Hence,  under  parametric  conditions  the 
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investigator  enjoys  a  “boost”  in  statistical  power 
when  testing  H  :  r  =  0.  Current  results  suggest 
this  “boost”  is  not  justified  in  the  present  data;  the 
likelihood  of  0  falling  in  the  confidence  interval 
seems  to  be  about  the  same  for  both  r  and  r . 

c 

3.  Given  the  apparent  absence  of  bivariate  normality 
in  the  current  data,  tentative  implications  can  also 
be  drawn  for  tests  of  H  :  p~  p  ^0  and  of  H  : 

-  R yjcix2 •  Specifically,  parametric  tests  of  Hq:  p  = 
pff  0  require  use  of  Fisher’s  ^transformation.  In 
the  presence  of  a  constant  effect  size,  the  resultant 
Z  test  (Bobko,  1995,  p.  54)  literally  requires 
doubl  e  the  sample  size  to  attain  the  same  statisti¬ 
cal  power  as  a  test  of  Ho:  p  =  0.  Again,  the  absence 
of  bivariate  normality  suggested  by  the  current 
results  implies  similar  bootstrapping  procedures 
should  be  used  to  assess  whether  the  90%  confi¬ 
dence  intervals  for  p  -  po  and  RYX1X2  -  Ryxi 
contain  0. 

Overall,  these  results  indicate  that  accurate  esti¬ 
mation  of  validity  coefficients  by  bootstrap  may  be 
technically  feasible.  However,  two  factors  may  limit 
the  practical  application  of  the  method  at  present. 
First,  current  professional  guidelines,  standards,  prin¬ 
ciples,  and  practices  in  selection  test  validation  are 
based  on  traditional  parametric  statistics.  Further 
methodological  research  and  empirical  demonstra¬ 
tions  must  be  conducted  to  provide  the  technical 
foundation  for  revising  these  professional  canons. 
Second,  personnel  selection  tests  and  their  validation 
are  subject  to  legal  review.  Statistical  evidence  in 
employment  discrimination  litigation  has  probative 
value  only  to  the  degree  that  the  underlying  theory, 
model,  and  method  are  credible  (Howard,  1994). 
Bootstrap  is  not  without  controversy,  and  therefore 
may  not  be  viewed  as  credible  in  litigation.  The 
principal  challenge  may  come  from  the  Uniform 
Guidelines  admonition  to  "...  avoid  reliance  upon 
techniques  which  to  tend  to  overestimate  a  validity 
finding  as  a  result  of  capitalization  on  chance  ...  .”  If 
not  carefully  explained,  the  bootstrap  may  have  the 
appearance  of  exploiting  chance  to  an  unprecedented 
degree.  Further  research  is  required  to  demonstrate 
that  the  bootstrap  does  not  lead  to  overestimates  of 
validity,  nor  does  it  capitalize  on  chance. 
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FIGURES 
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Figure  1 :  Frequency  distribution  of  TMC  scores  in  unrestricted  (Applicant)  sample  N  = 
206,592  ( X=  75.20,  a  =  14. 1 1 ) 
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Figure  2:  Frequency  distribution  of  TMC  scores  in  competitively  hired  ATCS 
sample:  N  =  1 0,869 (X=91 .46, a  =  5.02 ) 


AIRCRAFT  ALTITUDE 

10  7000 

20  7000 

30  7000 

40  6500 

50  6500 

60  8000 

70  8000 

SAMPLE  QUESTION 
WHICH  AIRCRAFT  WILL  CONFLICT? 

A.  60  AND  70 

B.  40  AND  70 

C.  20  AND  30 

D.  NONE  OF  THESE 
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Figure  3:  Example  Multiplex  Controller  Aptitude  Test  (MCAT)  item 
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Figure  4:  Example  Abstract  Reasoning  (ABSR)  item 
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Figure  5:  Flowchart  of  bootstrap  analysis  of  simple, 
uncorrected  TMC-NLCOMP  correlations 
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Begin  Bootstrap 
iteration  B 


Figure  6:  Flowchart  of  bootstrap  analysis  of  TMC-NLCOMP 
correlations  with  corrections  for  restriction  in  range 
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Figure  7:  Flowchart  of  bootstrap  analysis  of  correlations  from 
bivariate  normal  population  where  p  =  .182  and  c  computed 
by  Equation  2  for  bootstrap  sample  size  of  n 
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TABLE 


Table  1 

Estimates  of  90%  confidence  interval  and  median  for  uncorrected  and  corrected  TMC- 
NLCOMP  validity  coefficients  (rb)  for  B=  1,000  bootstrap  samples  of  n  =  25,  50,  ....  200 


90%  Cl  Boundaries  and  Median 

Analysis 

5% 

50% 

95% 

Uncorrected 

-0.251 

77=25 

0.207 

0.587 

Corrected 

-0.589 

0.511 

0.898 

Uncorrected 

-0.110 

O 

<n 

II 

0.195 

0.476 

Corrected 

-0.298 

0.487 

0.836 

Uncorrected 

-0.072 

77=75 

0.186 

0.423 

Corrected 

-0.198 

0.469 

0.796 

Uncorrected 

-0.020 

77=  100 

0.195 

0.404 

Corrected 

-0.055 

0.488 

0.779 

Uncorrected 

-0.008 

77=  125 

0.187 

0.377 

Corrected 

-0.021 

0.473 

0.753 

Uncorrected 

-0.001 

o 

m 

II 

0.191 

0.368 

Corrected 

-0.003 

0.481 

0.744 

Uncorrected 

0.016 

77=  175 

0.193 

0.346 

Corrected 

0.046 

0.483 

0.719 

Uncorrected 

0.034 

77=200 

0.190 

0.332 

Corrected 

0.094 

0.477 

0.703 
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APPENDIX  A 


Distributions  of  TMC-NLCOMP  correlations  uncorrected  for  range  restriction 
for  B  =  1,000  bootstrap  samples  of  n  =  25,  50, ...,  200 


Graph  A-1:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for 
range  restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  25 


Proportion  per  Bar 


Frequency 


5%  Cl  Lower:  -0.1 1 0  Median:  0.1 95  95%  Cl  Upper:  0.476 

Graph  A-2:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for  range 
restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  50 
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Proportion  per  Bar 


30 


0.03 


5%  Cl  Lower:  -0.072  Median:  0.186  95%  Cl  Upper:  0.423 

Graph  A-3:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for  range 
restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  75 
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Proportion  per  Bar 


5%  Cl  Lower: -0.020  Median:  0.195  95%  Cl  Upper:  0.404 

Graph  A-4:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for  range 
restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  100 
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5%  Cl  Lower:  -0.008  Median:  0.187  95%  Cl  Upper:  0.377 

Graph  A-5:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for  range 
restriction  for  6=1 ,000  bootstrap  samples  of  n  =  125 
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Proportion  per  Bar 


Frequency 


5%  Cl  Lower: -0.001  Median:  0.191  95%  Cl  Upper:  0.368 

Graph  A-6:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for  range 
restriction  for  B  =  1,000  bootstrap  samples  of  n  =  150 
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Graph  A-7:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for  range 
restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  1 75 
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Graph  A-8:  Distribution  of  TMC-NLCOMP  correlations  uncorrected  for  range 
restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  200 
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APPENDIX  B 

Distributions  of  TMC-NLCOMP  correlations  corrected  for  range  restriction  for 
B  =  1,000  bootstrap  samples  of  n  =  25,  50, ...,  200 


0.03 


0.02 


0.01 


0.0 

-10  -0.5  0.0  0.5  1.0 

5%  Cl  Lower:  -0.589  Median:  0.51 1  95%  Cl  Upper:  0.898 

Graph  B-1 :  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range 
restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  25 
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Graph  B-2:  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range 
restriction  for  B  =  1 ,000  bootstrap  samples  of  n  =  50 
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Graph  B-3:  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range  restriction 
for  B  =  1 ,000  bootstrap  samples  of  n  =  75 
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Graph  B-4:  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range  restriction 
for  B  =  1 ,000  bootstrap  samples  of  n  =  100 
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Graph  B-5:  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range  restriction 
for  B  =  1 ,000  bootstrap  samples  of  n  =  125 
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Graph  B-6:  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range  restriction 
for  B  =  1 ,000  bootstrap  samples  of  n=  150 
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Graph  B-7:  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range  restriction 
for  B  =  1 ,000  bootstrap  samples  of  n  =  1 75 
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Graph  B-8:  Distribution  of  TMC-NLCOMP  correlations  corrected  for  range  restriction 
for  B  =  1 ,000  bootstrap  samples  of  n  =  200 
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APPENDIX  C 


Distributions  of  TMC-NLCOMP  correlations  generated  for  a  bivariate  normal  population 
with  parameters  p  and  5^  from  B  =  1,000  bootstrap  samples  of  n  —  25,  50,  ...,  200 
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Graph  C-1:  Distribution  of  TMC-NLCOMP  correlations  generated  for  a  bivariate  normal 
population  with  parameters  p  and  sr  from  6=1 ,000  bootstrap  samples  of  n  =  25 
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Graph  C-2:  Distribution  of  TMC-NLCOMP  correlations  generated  for  a  bivariate  normal 
population  with  parameters  p  and  sr  from  B  =  1 ,000  bootstrap  samples  of  n  =  50 
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Graph  C-3:  Distribution  of  TMC-NLCOMP  correlations  generated  for  a  bivariate  normal 
population  with  parameters  p  and  sr  from  B  =  1 ,000  bootstrap  samples  of  n  =  75 


C3 


Proportion  per  Bar 


5%  Cl  Lower:  -0.010  p  =  .182,  sr  =  .0972  95%  Cl  Upper:  0.360 

Graph  C-4:  Distribution  of  TMC-NLCOMP  correlations  generated  for  a 
bivariate  normal  population  with  parameters  and  sr  from  B  -  1 ,000 
bootstrap  samples  of  n  =  1 00 
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Graph  C-5:  Distribution  of  TMC-NLCOMP  correlations  generated  for  a 
bivariate  normal  population  with  parameters  p  and  sr  from  B  =  1 ,000 
bootstrap  samples  of  n  =  125 
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Graph  C-6:  Distribution  of  TMC-NLCOMP  correlations  generated  for  a 
bivariate  normal  population  with  parameters  p  and  sr  from  6=1 ,000 
bootstrap  samples  of  n  =  150 
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Graph  C-7:  Distribution  of  TMC-NLCOMP  correlations  generated  for  a 
bivariate  normal  population  with  parameters  p  and  sr  from  B  =  1 ,000 
bootstrap  samples  of  n  =  175 
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Graph  C-8:  Distribution  of  TMC-NLCOMP  correlations  generated  for 
a  bivariate  normal  population  with  parameters  p  and  sr  from  B  =  1 ,000 
bootstrap  samples  of  n  =  200 
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